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Abstract. The use of patterns in predictive models is a topic that has 
received a lot of attention in recent years. Pattern mining can help to 
obtain models for structured domains, such as graphs and sequences, 
and has been proposed as a means to obtain more accurate and more 
interpretable models. Despite the large amount of publications devoted 
to this topic, we believe however that an overview of what has been 
accomplished in this area is missing. This paper presents our perspec- 
tive on this evolving area. We identify the principles of pattern mining 
that are important when mining patterns for models and provide an 
overview of pattern-based classification methods. We categorize these 
methods along the following dimensions: (1) whether they post-process a 
pre-computed set of patterns or iteratively execute pattern mining algo- 
rithms; (2) whether they select patterns model-independently or whether 
the pattern selection is guided by a model. We summarize the results that 
have been obtained for each of these methods. 



1 Introduction 

Important problems in data mining and machine learning are classification and 
pattern mining. In recent years an increasing number of publications have studied 
the combination of these problems. The main idea in these methods is that 
patterns can be used to define features or can be used as rules; classification 
models which make use of these features or rules may be more accurate or 
more simple to understand. Last, but not least, in structured domains, pattern 
mining can be considered a propositionalization approach which enables the use 
of propositional data mining and machine learning algorithms. 

Despite the large amount of publications devoted to this topic, we believe 
however that an overview of what has been accomplished in this area is missing. 
It is not uncommon for publications in this area to refer to only a small portion of 
relevant related work, hence preventing deeper insight or a general theory from 
evolving. As an example, Kralj et al. pointed out that the problems of subgroup 
discovery, contrast set mining and emerging pattern mining are so similar that 
their main differences are arguably the terminology used [38]. We believe that 
this phenomenon is much more wide-spread. For instance, in this paper we will 
point out that the independently proposed areas of correlating (or correlated) 



itemsct mining and discriminative itemset mining are also mostly identical to 
the problems studied in [38], except in name. 

The need to obtain a better insight in the accomplishments of this area 
has been observed by other authors. In particular, this has led to a tutorial 
at ICDM'07 by Bailey and Dong [3] (and an extensive online reference list), a 
tutorial at ICDM'08 by Cheng et al. [14] and a workshop at ECML PKDD [35]. 
In this paper, we present our perspective on this area, which differs from earlier 
perspectives in several key aspects. 

Pattern Type Independence: Other overviews have stressed the fact that 
there are different types of data, such as graph-based, tree-based and itemset- 
based data. They coupled pattern selection strategies to particular pattern 
types, and stressed the fact that different pattern mining algorithms are 
needed to deal with each such data type. Even though this is true, and 
indeed one often needs to implement a different pattern miner to deal with 
a pattern type at hand, we believe that it is more important in this case to 
stress the conceptual similarities between these pattern mining algorithms. 
Doing so leads to the insight that most approaches that have been proposed 
for complex data types, such as graphs, can easily also be implemented in 
pattern mining algorithms for simpler data types, such as itcmscts; this 
leads to a large number of additional approaches that itemset-mining based 
approaches could be compared with. 

Data Structure Independence: In a similar way, other tutorials have stressed 
the fact that even for the same data type, such as itemset data, different 
data structures may be used to speed-up the computation of the patterns; 
examples are the FP-Trees [14] and ZBDDs [3]. Even though the choice 
for such data structures may have a significant impact on the efficiency of 
the computation, we believe that most pattern-based classification problems 
are orthogonal to the choice of such data structures: most solutions can be 
combined with any such data structure. 

Iterative Mining: Initial approaches which combined pattern mining and clas- 
sification models took a strict step- wise approach, in which a set of patterns 
is computed once and these patterns are subsequently used in models. How- 
ever, in more recent years a large number of methods have been proposed 
which aim at integrating pattern mining, feature selection and model con- 
struction. In this paper we give a central position to such approaches. 

In Section 2 we present the key components of our proposed framework. The 
state-of-the-art of these components is discussed in more detail in subsequent 
sections. 

2 Overview 

The main idea of pattern-based classification is that patterns define new features, 
which can be used in a classification model. A simple example is provided in the 
figure below. 
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Essentially, a pattern is a regularity that is observed in a number of examples, 
in our example {A, B} is a pattern that occurs in the first example and third 
example. Whether this regularity is present or not in an example can be seen as 
a feature of each example. A prediction can be based on this, for instance, if an 
example includes items A and B we may predict the example to be positive. 
The key challenges in finding pattern-based models are: 

— how to find a set of patterns; 

— how to combine patterns into models. 

We distinguish approaches in the literature along the following dimensions: 

Iterative Mining or Post-Processing: when a set of patterns is constructed, 
this can be done in two ways. We can run a pattern mining algorithm once to 
find a large set of patterns, and post-process its result to obtain a smaller set, 
or we can iteratively run a pattern mining algorithm, in each round finding a 
very small number of patterns (often only one) , taking into account previous 
patterns in each round. 

Model-Dependence or Model-Independence: when we search a set of pat- 
terns, we can use two types of criteria. We can use criteria that explicitly 
take into account the type of model in which the pattern will be used, or 
we can use criteria which are independent of the model; typically, in such 
a model-independent approach the aim is to find a set of patterns which is 
sufficiently diverse such that a more complex model, like an SVM, can be 
learned on the new features. 

Many approaches have been developed along each of these dimensions. We will 
provide an overview of these approaches in Sections 4.1, 5.1, 4.2 and 5.2. The 
following table clarifies how these sections correspond to these dimensions. 

Model-Dependent Model-Independent 
Post-Processing Section 4.1 Section 5.1 

Iterative Section 4.2 Section 5.2 

The general search strategy for all these approaches can be summarized as 
in Figure 1. Starting from a given set of examples the first step is to mine for 
patterns PS satisfying given constraints. In a following, optional, step a subset 
of these patterns is selected in order to optimize this set of patterns as features 
in a model. Eventually those patterns are used as features to induce a model 
M in a third step. The process allows for several ways of feedback. Each of 
the intermediate results can be evaluated with regard to its quality to derive 
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Fig. 1. The overall process from pattern mining over feature selection to model induc- 
tion. The dashed arrows show four possibilities of steering the process resulting from 
the two different dimensions we identified. 



constraints which can be used to guide the search/ selection of further patterns 
or even to restart the mining or selection step with adjusted parameters. If 
this feedback involves explicit consultation of the induced model, we refer to 
the aforementioned model-dependent methods, otherwise to model-independent 
ones. Most approaches in the literature can be described with this model and 
use one of the four different types of self-steering in the process which we will 
discuss in the subsequent sections. 

In all these approaches, an important problem is how to find one or more pat- 
terns, iteratively or not. The simplest approach is the post-processing approach 
which operates on frequent patterns. However, most approaches are more sophis- 
ticated and take the class attribute into account while mining patterns. Hence, 
an important question in all of these approaches, iterative, model-dependent or 
not, is how to find patterns that take a class attribute into account. We start 
with an overview of solutions to this problem in Section 3. 

3 Class-Sensitive Patterns 

The starting point for taking class labels into account is in all cases to compute 
the (possibly weighted) support of a pattern in all classes individually. One can 
distinguish these approaches for using the class-specific supports in constraints: 

— support constraints per class, for instance, a minimum support constraint 
on one class combined with a maximum support constraint on another class. 
Such constraints can involve explicit thresholds, as for version space patterns 
[39] , a minimum difference between support values for emerging patterns, or 
a maximum support of zero for an individual class as for jumping emerging 
patterns [41, 21]. 

— constraints on scores computed from supports, sometimes in addition to 
support constraint. Many alternative measures for correlation strength have 
been proposed, ranging from confidence, lift, weighted relative accuracy or 
novelty, to x 2 , the correlation coefficient, information gain, Fisher score and 
others, including measures derived from classification models, such as in 
gBoost [52]. 



Patterns satisfying constraints on derived scores have been called emerging pat- 
terns [18], subgroup descriptions [34,71,27], contrast sets [5], correlating patterns 
[48], discriminative patterns [15], and interesting rules [6,47]. In this case one 
may not be interested in finding all patterns satisfying the constraints. Instead, 
one may be interested in finding top-fc scoring patterns, or finding top-fc patterns 
per instance in the training data [67] . 

It was pointed out in [38] that contrast sets, emerging patterns and subgroups 
are compatible terms, in the sense that these terms serve the same purpose of 
denoting patterns that score high with respect to a scoring function that takes 
class labels into account. A similar observation can be made regarding correlated 
patterns, discriminative patterns and interesting patterns. In this paper we do 
not endeavour to make a choice for one of these terms; to avoid this we will call 
such patterns class-sensitive patterns for the course of this paper. Whether the 
community should agree on a common name, and which one this should be, is 
not an issue we wish to discuss here. 

In many cases threshold-based constraints are not effective enough to obtain 
smaller, non-redundant sets of patterns. One means to obtain smaller sets of 
patterns is to extend condensed representations, such as closed, free and non- 
derivable patterns [60,7,12], to the context of class-sensitive patterns [69,67, 
25]. 

Given the similarity in purpose of these patterns, it is not surprising that 
similar search strategies have been developed for each of them. Approaches that 
have been studied include post-processing frequent itemsets [2,33,15,43] (for 
subgroups, discriminative patterns, interesting patterns, emerging patterns, in 
some cases with an additional support threshold), branch- and-bound search [69, 
71,4,48,67,16,1,27,52] (for subgroups, correlated patterns, contrast sets, dis- 
criminative patterns, gBoost), or variations of iterative deepening [11, 73, 13] (for 
correlated patterns, discriminative patterns). The reason that branch-and-bound 
searches have been proposed for class-sensitive pattern mining is that finding 
such patterns has been proved to be computationally hard. Proofs can be found 
in [48,68]. 

A main difference between the papers studying class-sensitive patterns, is 
the choice for the scoring function. For instance, weighted relative accuracy is 
commonly used in subgroup discovery, while \ 2 is common in correlated pattern 
mining. Insight in the differences between these measures can be obtained by 
comparing them in ROC space [22,51,50]. Among others, such studies allow 
to compare how well the different measures can be bounded in a branch-and- 
bound search. Furthermore, such studies led to the insight that for some pattern 
domains (such as itemsets) better bounds exist than for other pattern domains 
[50]; however, all bounds introduced before [50] are pattern-domain independent, 
allowing for the application of existing strategies. 

Despite that most bounds are pattern-domain independent, the combination 
of such bounds with approaches for dealing with particular pattern domains has 
received significant attention. Pattern domains that have been studied are item- 
set or attribute-value data (including [15,48,5,6,21]), sequences [11,31], tree- 



structured patterns [76,30], and graphs (including [11,26,73,52]). Approaches 
for class-sensitive itemset mining have been implemented using optimized data 
structures such as FP-trees [28,16,2] or binary decision diagrams (BDDs) [44]. 
In the remainder of this paper, we will present approaches independent of the 
data type or data structure for which they were proposed, as most approaches 
are conceptually independent of this. 

An issue which has received limited attention is that of false positives. It is 
likely that in an exhaustive branch-and-bound search a pattern with a high score 
can be found, but this pattern may overfit the training data. How to control this 
error seems to be an open question; initial approaches suggest for instance to 
modify the scoring function [5,70]. 

4 Model-Independent Pattern Selection 

Mining class-sensitive patterns is usually the easy part, however. In many set- 
tings the number of patterns that is found is too large, in the sense that building 
classifiers on them is inefficient, overfitting is likely, and interpretability of the 
models may be hard. For patterns to be actually useful, there is the need to 
create a more compact set of effective patterns. If we do no take the subsequent 
model into account, the main aim of the pattern selection step is to reduce the 
redundancy of the pattern set. We can distinguish approaches which achieve this 
by post-processing an initial set of patterns, and approaches which iteratively 
search for patterns that increase the diversity of the pattern set. 

4.1 Model-independent post-processing 
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Fig. 2. The model-independent post-processing approach: feedback is given by evalu- 
ation of the partially selected subset to steer feature selection. 

Post-processing the result of pattern mining according to certain criteria has 
two distinct advantages: first, both for finding the initial set of patterns as for 
reducing this set, we can use or adapt existing well-developed pattern mining 
techniques, which are usually rather efficient. Second, it is in many cases possible 
to explicitly control the properties of the resulting pattern set or at least to give 
guarantees about them. 

A variety of measures and constraints, and algorithms finding sets that satisfy 
them, have been proposed so far. In many cases, one is interested in finding a set 
of patterns that optimize a global criterion of diversity based on the occurrences 
of patterns in the data, sometimes in addition to explicit constraints. An example 
of a global criterion is entropy: if we select n patterns, we can encode every 
example in the data with a bit-vector of length n. This gives every bit- vector of 



length n a probability in the data. The entropy of this distribution can be used as 
a measure of diversity. Such sets of diverse sets can be searched exhaustively [36, 
37,54]. In practice, these approaches do not scale well, and more greedy search 
strategies are needed. While the focus often is somewhat different, the general 
technique for selecting a subset of patterns by post-processing is very similar to 
filter approaches for feature selection. 

Initial proposals for measures of diversity did not provide for approximation 
guarantees that the pattern sets found were provably good [19]. However, more 
recently several pattern set criteria have been shown to be submodular, and 
consequently a greedy hill-climbing algorithm, which iteratively adds a highest 
scoring pattern to an initially empty set, achieves a result which approximates 
the optimum [24,61]. Other recent approaches attempt to reduce the computa- 
tional complexity of the pattern selection further, and study different measures 
for selecting patterns [19,9]. 

An alternative to data-only approaches is also to take into account the simi- 
larity between the pattern structures, hence taking into account mutual similar- 
ities between patterns. Also here one can define optimization criteria and greedy 
approaches for optimizing them [29,72]. 

A third set of approaches does not optimize a measure of diversity directly, 
but rather aims at finding a compact representation of the data; the idea is here 
that we wish to find a small set of patterns which allows to encode transactions 
as accurately as possible with as few patterns as possible. One can distinguish 
the MDL based approaches here [59,63], as well as the discrete basis problem 
[46,45]. 

Finally, machine learning-inspired sampling and verification techniques may 
also be used to obtain more diverse sets of patterns [10]. 



4.2 Model-independent iterative mining 
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Fig. 3. The model-independent iterative approach: the feedback derived by evaluating 
the pattern set directly influences the mining of individual patterns. 

The alternative to feature selection lies in feature construction. In general, the 
idea is here to avoid generating patterns beforehand, but to search for patterns 
during the selection process. The main advantage is that we only find patterns 
that have meaning in the presence of other patterns. This is not necessarily 
the case in the setting described in the former section since the pattern mining 
operation itself does not take into account the relationships between patterns, 
and may produce many patterns which could have been pruned if the pattern 
search was more aware of the subsequent pattern selection. 

A first strategy is to adapt post-processing algorithms. Whereas greedy post- 
processing algorithms iteratively search for a pattern in a pre-computed set of 



patterns, this search for patterns can in some cases also be performed by a 
pattern mining algorithm. The main observation is that given an already selected 
set of patterns, some scoring functions for measuring the diversity of a new 
pattern set are boundable, and hence we can use similar strategies to find new 
patterns as in class-sensitive pattern mining [57,61], hence avoiding having to 
pre-compute a set of patterns. 

An alternative approach lies in using a model-dependent iterative strategy (as 
discussed in Section 5.2); one can ignore the model produced by these strategies 
afterwards and use the patterns as features in other classification models [16]. 
The difference between model-dependent and -independent approaches is thus 
sometimes not as clear-cut as our terminology suggests. 



5 Model-Dependent Pattern Selection 

While all the methods described in the preceding select patterns and sets of 
patterns using scoring functions, these scoring functions are not influenced by 
the choice of model that will be constructed from the patterns. Even though in 
model-independent approaches patterns may be used in SVMs, which show very 
good accuracy and guard against overfitting to a certain degree, the resulting 
models are difficult to interpret. The alternative is to use patterns directly to 
predict class labels, giving users the advantage of being able to examine and 
interpret the model. When doing this, it is often advantageous to adapt the 
scoring function to the model in which the patterns will be used. Also here we 
distinguish the existing methods into post-processing and iterative approaches. 



5.1 Model-dependent post-processing 
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Fig. 4. The model-dependent post-processing approach: feedback is given by the model 
itself to steer feature selection. 

These approaches are also known as methods for associative classification. 
A variety of approaches have been proposed towards building a classifier from 
rules and performing predictions using selected patterns. 

The simplest such approaches post-process all patterns found in a previous 
phase; they rely on a conflict resolution strategy, similarly to unordered rule 
lists, which often means that in order to predict which class an example belongs 
to, a score is computed for each class from the patterns for that class. Many 
such scoring strategies have been proposed [19, 67, 41, 75, 56, 1, 63] . In some cases, 
such as [63] , another approach on the border between model-dependence and - 
independence, model-independent pattern selection takes place before patterns 
are used in such a voting scheme. 



The alternative is to perform an ordered heuristic search over a set of pat- 
terns, guided by a database coverage constraint. In a sense this is a post- 
processing version of the sequential covering/weighted covering paradigm known 
in machine learning. Essentially, these approaches execute these steps: 

1. they sort the patterns; 

2. they select a pattern according to this sorting order; 

3. they optionally remove some of the remaining unselected patterns; 

4. they optionally resort remaining unselected patterns according to updated 
scores; 

5. they recursively continue selecting a pattern. 

Strategies implementing this idea have been studied in [43,42, 76]. 

While these approaches construct a model greedily, [49] showed that itemsets 
can be post-processed to construct a decision tree optimally. In this approach, an 
itemset corresponds to a path from the root to a leaf in a decision tree. Itemsets 
are selected from a set such that the resulting tree is optimal given user-specified 
constraints and criteria. 

5.2 Model-dependent iterative mining 
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Fig. 5. The model-dependent iterative approach: feedback given by the model influ- 
ences which patterns are mined next. 

In model-dependent, iterative mining techniques the connection to and in- 
spiration by machine learning becomes most obvious. Hence, these approaches 
are best understood as adaptations of machine learning techniques. We can dis- 
tinguish the following classification models. 

FOIL-\ike decision list learning strategies: these are techniques that can be 
understood as adaptations of the FOIL rule learning technique, combined 
with the weighted covering metaheuristic [74] . 

Decision tree learning strategies: these methods adapt decision tree induction 
algorithms such as C4.5 [8,26, 14]; they iteratively search for class-sensitive 
patterns that split data as well as possible according to criteria such as 
information gain; the search continues in parallel for the data sets resulting 
from the split. 

Instance-based learning strategies: these are methods where pattern mining is 
delayed till a test example is given; class-sensitive patterns are searched that 
are relevant for the test instance [65,64,40]. 

Boosting strategies: these are methods in which classifications of patterns are 
weighted, and rules are found by iteratively reweighting examples [52, 53]. 



Regression strategies: these are methods in which predictions are based on 
weighted sums of patterns, and weights of patterns are found by linear re- 
gression [58] . The boosting and regression methods often include a regular- 
ization parameter which needs to be set. In [62] it was studied how to find 
the regularization path, which in this case can be seen as an ordered set 
of patterns, each prefix of which corresponds to a regression model for one 
choice of this parameter. 

As pointed out, one can also choose to ignore the model constructed by any 
model-dependent strategy, and use the patterns as features in another type of 
model. The two categories show therefore different kinds of flexibility: while 
model-dependent results can be used both directly and as building blocks of 
another model, they are probably best suited to the model that was used to 
derive them, differing from the results of model-independent techniques. 

6 Conclusions 

In this paper we presented our perspective on the area of pattern-based classi- 
fication. Key elements in our perspective are pattern type and data structure 
independence; instead, we propose to categorize approaches along two dimen- 
sions: whether they are model-dependent or model-independent, and whether 
they are iterative or non-iterative. 

For almost any quality measure and mining techniques, both the pattern 
language and the language in which data are expressed are not relevant for the 
pattern set selection phase as long as there is a well-defined matching operator 
between the two. Furthermore, almost all techniques for mining class-sensitive 
patterns themselves are independent of these aspects as well, with the exception 
of data structures used. Such data structures, however, typically do not influence 
the applicability of mining techniques but only their implementation. This means 
that it is possible to transfer approaches freely between different representations 
and settings, albeit possibly at a certain cost of efficiency. 

Iterative approaches have the advantage of taking the effects of already se- 
lected patterns into account by adjusting the scoring function in some way. This 
allows to focus on interesting areas of the pattern space, pruning subspaces that 
would have been explored in non-iterative mining, and visiting others that would 
have been ignored otherwise. The downside to this is that the space of potential 
solutions is far larger than in the non-iterative case, requiring the adoption of 
heuristic techniques and less control over, and looser guarantees for the quality 
of, resulting sets. Whereas early approaches were often model-dependent post- 
processing approaches, recent work more focuses on iterative approaches, both 
model-dependent and -independent. 

The trade-off involved in model-dependence and -independence has been 
sketched in the preceding section: the agnosticism of model-independence means 
that resulting sets can be expected to be useful to different kinds of modeling 
techniques instead of being tailored towards a particular model as in model- 
dependent solutions. In addition, while predictive models can be used in scoring 



functions, they do not exhaust the issue and therefore model-independent tech- 
niques can use measures that focus on different aspects of pattern relations and 
may eschew class labels completely. However, results that are produced by such 
approaches cannot be expected to be useful as direct predictors, making an addi- 
tional more or less complex modeling step necessary which will probably reduce 
interpretability. Model-dependent techniques, on the other hand, usually result 
in models in which the relationship among particular patterns and between pat- 
terns and prediction are far more easily accessible. In addition, resulting pattern 
sets can still be used as input to a different modeling step but might perform 
worse than pattern sets produced by model-independent approaches. 

A major issue in the current state-of-the-art is that so far it is not very clear 
to what degree the merits and drawbacks that can be derived analytically for 
different approaches materialize empirically. In most of the papers that proposed 
pattern-based classification algorithms, experiments were performed to show the 
benefits of the approaches. However, these comparisons were (understandably) 
often limited; they did not exhaustively consider all relevant comparable ap- 
proaches that derive if one would take pattern-type independence into account 
and recognize that graph-based approaches may also be used in simpler pattern 
domains; also, the number of data sets in most publications is limited and usu- 
ally restricted to one data type. A few recent publications have presented more 
exhaustive experimental comparisons; [17,66] compared post-processing pattern 
based classification with kernels and traditional approaches on a large number 
of molecular data sets, and obtained mixed results. Similarly, [32] compared ex- 
haustive rule discovery strategies to greedy ones on UCI data sets. These results 
are necessary steps into gaining a better insight in the true relative merits of the 
many pattern-based classification strategies. 
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